This document uses unnormalized TSS counts to test for differential usage of TSS’s on the same gene between conditions.
It starts with some general checks on the number of sites and number of regions used per gene.
It continues with a heuristic estimate by correlation analysis: are the levels of alt TSS’s correlated between conditions.
The highlight is the use of a negative binomial count model to assess the differential TSS usage between conditions. This model is explained in detail. We do a multiple comparisons correction at 1% FDR.
All the analysis is done on JEC21 data first. Then on H99 data, starting again from the beginning.
Why? Many negative correlations.
This model focuses only on the relative usage of different TSS’s changing between conditions. It does this by factoring out changes in the gene-level usage, within a statistical model incorporating count noise.
For each gene, we take a negative binomial distribution of counts \(x_{csi}\) at each TSS (site, \(s\)), in each condition (\(t\)), for each replicate \(i\): \[x_{sti} \sim Negbin( \exp( \alpha_{ti} + \beta_{ts} ), \theta )\]
Here the coefficients \(\alpha\) track different total counts between replicates for each gene. The \(\beta\) coefficients track different usage between sites at each condition.
The dispersion parameter \(\theta\) estimates how much more dispersed the counts are relative to a poisson model. A \(NegBin(\mu, \theta)\) variable has mean \(\mu\) and variance \(\mu + \mu^2/\theta\). We use the implementation of negative binomial with log link function from the MASS package’s negative.binomial function, and the glm function in base R to fit.
To estimate the value of \(\theta\) from the whole dataset, we estimate overdispersion of counts from 10 random subsets of single-site genes across conditions.
Later, to control the false discovery rate after testing thousands of genes, we will use the Benjamini-Hochberg criterion applied to the p-values of coefficients.
The data look like this, for a sample gene CNA00250/RPL9, so 12 data points = 2 sites x 2 conditions x 3 replicates.
## # A tibble: 12 x 5
## Gene_name Site Condition Rep Count
## <chr> <fct> <fct> <fct> <dbl>
## 1 CNA00250 1 Expo30 a 4460
## 2 CNA00250 2 Expo30 a 503
## 3 CNA00250 1 Expo30 b 2747
## 4 CNA00250 2 Expo30 b 207
## 5 CNA00250 1 Expo30 c 2271
## 6 CNA00250 2 Expo30 c 143
## 7 CNA00250 1 Expo37 a 1011
## 8 CNA00250 2 Expo37 a 328
## 9 CNA00250 1 Expo37 b 765
## 10 CNA00250 2 Expo37 b 221
## 11 CNA00250 1 Expo37 c 883
## 12 CNA00250 2 Expo37 c 272
The fit to this example data has coefficients (conditions 30, 37 degrees centigrade; replicates a,b,c; sites 1,2). We give the names of the coefficients both in accordance with the model above, and also as automatically assigned by R’s glm function.
## # A tibble: 8 x 5
## coeff term estimate std.error p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a_{30,a} (Intercept) 8.56 0.118 0.000000216
## 2 b_{30,2} Site2 -2.49 0.124 0.0000364
## 3 a_{37,a} - a_{30,a} ConditionExpo37 -1.61 0.169 0.000672
## 4 a_{30,b} - a_{30,a} ConditionExpo30:Repb -0.677 0.149 0.0104
## 5 a_{37,b} - a_{30,a} ConditionExpo37:Repb -0.331 0.153 0.0957
## 6 a_{30,c} - a_{30,a} ConditionExpo30:Repc -0.932 0.151 0.00346
## 7 a_{37,c} - a_{30,a} ConditionExpo37:Repc -0.160 0.151 0.351
## 8 b_{37,2} - b_{30,2} ConditionExpo37:Site2 1.31 0.176 0.00174
The key question for differential alt usage is if \[ \beta_{37,2} - \beta_{30,2} \] i.e. ConditionExpo37:Site2, is significant and substantial. For this gene, the difference is big, as we can see from the graph, and significant as we see from the p value of 0.0017.
The fit has 8 coefficients for 12 data points, so 4 degrees of freedom:
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1072. 11 -66.5 149. 153. 6.80 4
This suggests that most alt sites are either very differentially used or very similarly used; there’s not much middle ground.
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 855 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNA08250 ConditionExpo37:Site2 170096 0.771 0.0000676
## 2 CNN00330 ConditionExpo37:Site2 93619 3.20 0.0000426
## 3 CNK02240 ConditionExpo37:Site2 72448 0.423 0.000217
## 4 CNA04340 ConditionExpo37:Site2 54224 1.50 0.00000913
## 5 CNA04920 ConditionExpo37:Site2 45005 0.667 0.000714
## 6 CNK02330 ConditionExpo37:Site2 44647 0.825 0.00000870
## 7 CNB02670 ConditionExpo37:Site2 42510 1.86 0.000000897
## 8 CNB02670 ConditionExpo37:Site3 42510 0.608 0.000457
## 9 CNC06180 ConditionExpo37:Site2 31787 1.47 0.000641
## 10 CNK03330 ConditionExpo37:Site2 27867 1.86 0.0000349
## # ... with 845 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 1,143 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNA08250 ConditionStat30:Site2 170096 1.54 0.0000379
## 2 CNF04420 ConditionStat30:Site2 138837 2.23 0.0000484
## 3 CNJ00950 ConditionStat30:Site2 78481 0.799 0.000000220
## 4 CNJ00950 ConditionStat30:Site3 78481 0.708 0.000000566
## 5 CNJ00950 ConditionStat30:Site4 78481 0.432 0.0000716
## 6 CNM02200 ConditionStat30:Site2 69744 1.55 0.0000342
## 7 CNA04340 ConditionStat30:Site2 54224 0.667 0.000553
## 8 CNK02330 ConditionStat30:Site2 44647 0.778 0.000140
## 9 CNM02380 ConditionStat30:Site2 40807 0.912 0.0000309
## 10 CNC04820 ConditionStat30:Site2 33913 0.968 0.000799
## # ... with 1,133 more rows
## # A tibble: 1,881 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNJ00410 ConditionStat37:Site2 51084 0.977 0.000343
## 2 CNL05140 ConditionStat37:Site2 45132 0.307 0.00167
## 3 CNK02330 ConditionStat37:Site2 44647 0.905 0.000401
## 4 CNB02670 ConditionStat37:Site2 42510 0.534 0.00119
## 5 CNA00170 ConditionStat37:Site2 39848 0.734 0.000956
## 6 CNC06180 ConditionStat37:Site2 31787 1.52 0.000311
## 7 CNA06460 ConditionStat37:Site2 31648 0.477 0.0000365
## 8 CNL04750 ConditionStat37:Site2 24597 0.745 0.000438
## 9 CNC02870 ConditionStat37:Site2 24045 0.474 0.0000336
## 10 CNM01710 ConditionStat37:Site2 22342 0.831 0.0000411
## # ... with 1,871 more rows
## # A tibble: 1,947 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNA08250 ConditionStat37:Site2 170096 2.28 0.000000837
## 2 CNF04420 ConditionStat37:Site2 138837 2.34 0.0000308
## 3 CNN00330 ConditionStat37:Site2 93619 3.36 0.000169
## 4 CNK02310 ConditionStat37:Site2 84186 1.03 0.000232
## 5 CNM02200 ConditionStat37:Site2 69744 0.993 0.000106
## 6 CNA04340 ConditionStat37:Site2 54224 1.51 0.000286
## 7 CND04780 ConditionStat37:Site2 48791 1.05 0.000211
## 8 CNF02040 ConditionStat37:Site2 45927 0.602 0.00156
## 9 CND01440 ConditionStat37:Site2 45905 0.617 0.00114
## 10 CNK02330 ConditionStat37:Site2 44647 0.699 0.000755
## # ... with 1,937 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 737 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNA08250 Conditionupf1:Site2 170096 0.988 0.0000858
## 2 CNB02670 Conditionupf1:Site3 42510 0.331 0.000630
## 3 CNM02380 Conditionupf1:Site2 40807 0.985 0.000207
## 4 CNA05220 Conditionupf1:Site2 29490 1.48 0.000142
## 5 CNK03330 Conditionupf1:Site2 27867 0.820 0.000661
## 6 CNL04750 Conditionupf1:Site3 24597 1.29 0.0000748
## 7 CNG03220 Conditionupf1:Site3 24505 0.559 0.000204
## 8 CNG01250 Conditionupf1:Site3 23108 1.43 0.000000106
## 9 CNB02690 Conditionupf1:Site2 22186 0.757 0.000100
## 10 CNN00420 Conditionupf1:Site2 22026 10.9 0.0000756
## # ... with 727 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 729 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNN00330 Conditionupf1:Site2 93619 2.46 0.000108
## 2 CNA04340 Conditionupf1:Site2 54224 0.988 0.000252
## 3 CNB02670 Conditionupf1:Site2 42510 0.719 0.000337
## 4 CNB02670 Conditionupf1:Site3 42510 0.867 0.000163
## 5 CNA00250 Conditionupf1:Site2 32405 0.634 0.000176
## 6 CNA06460 Conditionupf1:Site2 31648 0.698 0.000250
## 7 CNF00330 Conditionupf1:Site2 25959 1.28 0.000654
## 8 CNG03220 Conditionupf1:Site3 24505 0.474 0.000105
## 9 CNC02870 Conditionupf1:Site2 24045 0.817 0.0000216
## 10 CNJ03160 Conditionupf1:Site2 23055 21.7 0.000000970
## # ... with 719 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 1,162 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNG00830 Conditionupf1:Site2 175616 0.717 4.58e- 4
## 2 CNA08250 Conditionupf1:Site2 170096 1.68 6.34e- 5
## 3 CNF04420 Conditionupf1:Site2 138837 2.34 3.57e- 5
## 4 CNJ00950 Conditionupf1:Site2 78481 0.449 3.20e-10
## 5 CNJ00950 Conditionupf1:Site3 78481 0.508 6.31e-11
## 6 CNM02200 Conditionupf1:Site2 69744 1.70 3.94e- 6
## 7 CNA04360 Conditionupf1:Site2 51497 0.380 6.30e- 4
## 8 CNL05140 Conditionupf1:Site2 45132 0.462 2.54e- 4
## 9 CNM02380 Conditionupf1:Site2 40807 0.880 9.74e- 4
## 10 CNC06180 Conditionupf1:Site2 31787 1.03 4.55e- 4
## # ... with 1,152 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 822 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CND04780 ConditionExpo37:Site2 48791 2.12 0.000238
## 2 CNB02670 ConditionExpo37:Site2 42510 1.87 0.00000518
## 3 CNB02670 ConditionExpo37:Site3 42510 0.590 0.0000512
## 4 CNA00170 ConditionExpo37:Site2 39848 0.761 0.000319
## 5 CNK03330 ConditionExpo37:Site2 27867 1.11 0.0000384
## 6 CNG03220 ConditionExpo37:Site3 24505 0.601 0.0000731
## 7 CNC02870 ConditionExpo37:Site2 24045 0.476 0.000449
## 8 CNN00420 ConditionExpo37:Site2 22026 10.9 0.0000756
## 9 CNK00060 ConditionExpo37:Site2 21117 1.99 0.00000848
## 10 CNC05790 ConditionExpo37:Site2 20518 1.55 0.0000203
## # ... with 812 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 1,118 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNA08250 ConditionStat30:Site2 170096 0.841 0.000375
## 2 CNK02310 ConditionStat30:Site2 84186 0.983 0.000127
## 3 CNJ00950 ConditionStat30:Site4 78481 0.519 0.0000288
## 4 CNL04650 ConditionStat30:Site2 78263 1.77 0.00000731
## 5 CND01200 ConditionStat30:Site2 64717 0.741 0.000341
## 6 CNB01800 ConditionStat30:Site2 64334 0.685 0.000535
## 7 CNA04360 ConditionStat30:Site2 51497 0.767 0.000352
## 8 CNF02040 ConditionStat30:Site2 45927 0.603 0.0000782
## 9 CNB02670 ConditionStat30:Site2 42510 1.15 0.000107
## 10 CNG00970 ConditionStat30:Site2 41577 0.912 0.0000257
## # ... with 1,108 more rows
There are many negative correlations.
Some code that I used to test and set up the models is here
The code in the next chunk showed that fitting a single model across many hundreds/thousands of genes would be prohibitively slow to fit. In principle it would be possible to recode as a hierarchical model with information pooled between genes (e.g. dispersion \(\theta\)). However, it was more efficient to estimate \(\theta\) from some subsamples, and then fit each gene separately, the approach we took in the end.
This is the model we actually used, as explained above under JEC21.
Negative binomial is an extension of Poisson model with extra parameter theta; NegBin(mu, theta) has mean mu and variance mu + mu^2/theta.
This suggests that most alt sites are either very differentially used or very similarly used; there’s not much middle ground.
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 797 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNAG_01990 ConditionExpo37:Site2 39447 0.426 0.0000733
## 2 CNAG_05774 ConditionExpo37:Site2 27812 0.401 0.000233
## 3 CNAG_04585 ConditionExpo37:Site2 23225 0.919 0.0000481
## 4 CNAG_01230 ConditionExpo37:Site3 20703 0.881 0.000151
## 5 CNAG_00995 ConditionExpo37:Site2 20613 1.94 0.000451
## 6 CNAG_03492 ConditionExpo37:Site2 19717 2.12 0.000669
## 7 CNAG_02527 ConditionExpo37:Site2 15437 0.956 0.000373
## 8 CNAG_02527 ConditionExpo37:Site3 15437 0.849 0.000637
## 9 CNAG_01043 ConditionExpo37:Site2 14653 0.516 0.000163
## 10 CNAG_06120 ConditionExpo37:Site2 13480 1.45 0.000000239
## # ... with 787 more rows
Coloured for significance at 1% FDR, Benjamini-Hochberg.
## # A tibble: 721 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNAG_04388 ConditionStat30:Site2 23624 0.864 0.000597
## 2 CNAG_01019 ConditionStat30:Site2 20799 1.98 0.000000624
## 3 CNAG_01230 ConditionStat30:Site3 20703 0.921 0.000131
## 4 CNAG_02100 ConditionStat30:Site2 17802 0.215 0.0000950
## 5 CNAG_00457 ConditionStat30:Site2 17646 0.796 0.0000436
## 6 CNAG_05980 ConditionStat30:Site2 16853 13.6 0.000000389
## 7 CNAG_02880 ConditionStat30:Site2 16214 0.478 0.000449
## 8 CNAG_06120 ConditionStat30:Site2 13480 1.80 0.0000534
## 9 CNAG_06120 ConditionStat30:Site3 13480 1.63 0.0000115
## 10 CNAG_02500 ConditionStat30:Site2 13357 0.573 0.000544
## # ... with 711 more rows
- CNAG_04245, chitinase.
- CNAG_06817, UAP cation symporter
## # A tibble: 1,713 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNAG_06125 ConditionStat37:Site2 277167 0.269 0.00157
## 2 CNAG_06125 ConditionStat37:Site3 277167 0.280 0.00116
## 3 CNAG_01884 ConditionStat37:Site2 51358 0.810 0.000617
## 4 CNAG_06238 ConditionStat37:Site2 50468 0.452 0.000358
## 5 CNAG_01486 ConditionStat37:Site2 45953 0.388 0.000162
## 6 CNAG_01486 ConditionStat37:Site3 45953 0.662 0.00000715
## 7 CNAG_01990 ConditionStat37:Site2 39447 0.702 0.000344
## 8 CNAG_04659 ConditionStat37:Site2 35362 0.334 0.00188
## 9 CNAG_06113 ConditionStat37:Site2 33203 0.317 0.000179
## 10 CNAG_05521 ConditionStat37:Site2 29217 0.366 0.000249
## # ... with 1,703 more rows
## # A tibble: 1,658 x 5
## Gene_name term Tot log10abs p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CNAG_04004 ConditionStat37:Site2 106287 0.539 0.000591
## 2 CNAG_04794 ConditionStat37:Site2 94099 0.959 0.00000908
## 3 CNAG_01884 ConditionStat37:Site2 51358 0.792 0.0000614
## 4 CNAG_06238 ConditionStat37:Site2 50468 0.459 0.00158
## 5 CNAG_06541 ConditionStat37:Site2 49505 0.450 0.000173
## 6 CNAG_01486 ConditionStat37:Site3 45953 0.534 0.0000203
## 7 CNAG_01990 ConditionStat37:Site2 39447 0.526 0.0000660
## 8 CNAG_00091 ConditionStat37:Site2 37223 0.501 0.00115
## 9 CNAG_06113 ConditionStat37:Site2 33203 0.531 0.000731
## 10 CNAG_04021 ConditionStat37:Site2 33106 0.576 0.00163
## # ... with 1,648 more rows